Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: Add count parameter to limit generator in Series, DataFrame, and DataFrame.from_records() #5898

Closed
wants to merge 1 commit into from

Conversation

tinproject
Copy link
Contributor

When reading data from a generator type collection only reads the firsts number (count) of values.
First step to solve #2305, known the length of the data to allocate memory.

  • Add count parameter to Series, DataFrame, and DataFrame.from_records().
  • In DataFrame.from_records() deprecate the existing parameter nrows. Count it's more general and only refers to quantity of data units. Also exists on numpy API (fromiter).
  • Some refactor in DataFrame.from_records().
  • Add tests too.
  • Lack of release docs for now.

Add count parameter to Series, DataFrame, and DataFrame.from_records(). When reading data from a
generator type collection only reads the firsts count values.
Some refactor in DataFrame.from_records(). Add tests too.
@jreback
Copy link
Contributor

jreback commented Jan 10, 2014

can u give an actual use case of this?

@tinproject
Copy link
Contributor Author

You can do random walks from infinite random number generator.
If you have a big file and yields every line processed with a generator, limit the number of lines readed.

I't focused on the solution of 2305, to load data directly and in a efficient manner you need to know how many memory you will need, and generators and iterators are of indefinite length so aditional help it's needed.

@jreback
Copy link
Contributor

jreback commented Jan 10, 2014

not a big fan of adding a keyword to the constructors

maybe a better way is to allow data to be a callable them u can imbed islice if you wanted too

@jtratner
Copy link
Contributor

At the very least, better to make this a classmethod

@ghost
Copy link

ghost commented Jan 18, 2014

me third, wrong approach. support reduced memory data loading from an iterator of known length
is not going to happen by extending existing methods.

viz. recent discussion in #2193, we like the idea done as a new class method.
Overview:

  1. take a count
  2. infer dtypes from first line (or a few if nans encountered)
  3. pre-allocate numpy arrays of known size and dtype
  4. consume the iterator and fill in the preallocated array
  5. pass the arrays or whatever works to reuse the arrays as the underlying
    data for a pandas data object.

I planned to try this myself for 0.14, but if you beat me to it so much the better.
If block manager confuses you, do 1-4+tests and we can take it from there.

This I'm closing.

@ghost ghost closed this Jan 18, 2014
@ghost
Copy link

ghost commented Jan 18, 2014

p.s.
If you nail down a solid, reproducible way of measuring the memory allocation difference, even better.

@jtratner
Copy link
Contributor

side note - why is it useful to not create an intermediate list or ndarray? I assure you that even if you pass in a generator, eventually it's going to be converted into list or ndarray, then have its types massaged, etc. The sugar of pandas does sacrifice some memory efficiency in loading and manipulating data. It's a tradeoff and sometimes you might need to stick with numpy for really performance critical parts.

@ghost
Copy link

ghost commented Jan 18, 2014

@jtratner , are you asking this having read through the discussion in #2193?

@tinproject
Copy link
Contributor Author

@y-p go fot it, currently I'm only a hobby programmer and don't have many time available to make it, but I'm happy to help on anything in my little spare time.

I have been thinking on this problem of memory consumtion for months, and I came to the same task list as you, actually this PR match the number 1 task, it was thought for it.

I made this on hollydays time, couse I didn't have enought time to make all the code I've filled GH5902 to 'try' to express my ideas on the rest of the process. Unluckily I'm not a native english speaker and I dont express myself as good as I want to.

This PR it's focus on the API changes: add of a 'count' attribute. Maybe I aim to a too broad inclusion of the attibute for consistency but anyway you are going to need a 'count' attribute becouse generatos/iterators are not sized objects, they have no __len__(), and therefore it must be told explicitely.

On memory consumtion measure.

Time ago I try with IPython notebooks and memit, but I prefer to use common sense.

  • Imagine you have a 1000 number generator, this generator yields int python objects. Pandas put the whole generator in a list, of python int objects, then put the objects of the list inside an ndarray. Result, twice the memory needed used, 1000 x PythonIntSyze + 1000 x dtypeSize + ndarray overhead.
  • The same generator, this time readed in chuncks of 100 objects (when read in chunks don't need to know the length of the gernator). Consume a chunck length number of objects of the generator, put them on a list, and then put the chunk on a ndarray. When have all the chuncks in ndarray form put all in one unique ndarray. Depends of number of objects maybe less, maybe more memory than the point before depends on objects overhead but in the same order: 1 x 100 x PythonIntSyze + 10 x (100 x dtypeSize + ndarray overhead)+ 1000 x dtypeSize + ndarray overhead.
  • The same generator, one more time. This time you know the length of the data and the dtype sou you can allocate the ndarray and puth the data yielded by the generator one by one. Result, only use the memory needed: 1 x PythonIntSyze + 1000 x dtypeSize + ndarray overhead.

In case the generator yields more results than the count, you can simply ignore them. In case the generator exausted before count resize the ndarray to the correct size. As it only downsize the ndarray there will not be copy/move operations in memory so no performance penalty for it.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants